The purpose of this assignment is to develop a data-driven solution for a real estate company seeking to invest in the Nashville area. Through tasks including data cleansing, model building, evaluation, and recommendation, the aim is to create a predictive model that accurately assesses property values and identifies instances of overpricing or underpricing. By comparing various modeling techniques such as logistic regression, decision trees, random forest, gradient boost, and neural network, the assignment seeks to determine the most suitable approach for the problem at hand. Additionally, exploration of ensemble modeling techniques aims to enhance predictive accuracy and robustness. Ultimately, the assignment aims to provide actionable insights to the real estate company, enabling them to make informed investment decisions and maximize returns in the dynamic Nashville real estate market.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.model_selection import cross_val_score
from sklearn import metrics
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, roc_auc_score, recall_score
from sklearn.metrics import mean_squared_error
from sklearn import datasets
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
import datetime as dt
import warnings
warnings.filterwarnings("ignore")
housingData = pd.read_csv('Nashville_housing_data_2013_2016.csv')
Since we are looking to make investment in the growing Nashville area and build a model to accurately find the best deals. Therefore, we will only look at the data within the Nashville area.
# Dataset of Nashville area
housingData = housingData[housingData['Property City'] == 'NASHVILLE']
housingData.shape
(40280, 31)
housingData.info()
<class 'pandas.core.frame.DataFrame'> Index: 40280 entries, 0 to 56635 Data columns (total 31 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0.1 40280 non-null int64 1 Unnamed: 0 40280 non-null int64 2 Parcel ID 40280 non-null object 3 Land Use 40280 non-null object 4 Property Address 40280 non-null object 5 Suite/ Condo # 5341 non-null object 6 Property City 40280 non-null object 7 Sale Date 40280 non-null object 8 Sale Price 40280 non-null int64 9 Legal Reference 40280 non-null object 10 Sold As Vacant 40280 non-null object 11 Multiple Parcels Involved in Sale 40280 non-null object 12 Owner Name 19953 non-null object 13 Address 20700 non-null object 14 City 20700 non-null object 15 State 20700 non-null object 16 Acreage 20700 non-null float64 17 Tax District 20700 non-null object 18 Neighborhood 20700 non-null float64 19 image 20106 non-null object 20 Land Value 20700 non-null float64 21 Building Value 20700 non-null float64 22 Total Value 20700 non-null float64 23 Finished Area 19089 non-null float64 24 Foundation Type 19088 non-null object 25 Year Built 19089 non-null float64 26 Exterior Wall 19089 non-null object 27 Grade 19089 non-null object 28 Bedrooms 19078 non-null float64 29 Full Bath 19176 non-null float64 30 Half Bath 19074 non-null float64 dtypes: float64(10), int64(3), object(18) memory usage: 9.8+ MB
The dataset, with only properties located in Nashville, consists of 40280 samples and 31 variables. We will only use 17 variables for this assignment, including Land Use, Sale Price, Sold As vacant, Muiltiple Parcels Involved in Sale, Acreage, Taxt District, Land Value, Building Value, Total value, Finished Area, Foundation Type, Year Built, Exterior Wall, Grade, Bedrooms, Full Bath, and Half Bath.
The dataset contains missing values, so mode and median imputation methods will be employed to handle them. Removing duplicate values and outliers will help ensure the highest data quality for modeling. A correlation matrix will be conducted to understand the relationship between the dependent variable and independent variables. Unnecessary variables will be dropped. Additionally, multicollinearity will be detected and addressed using VIF (Variance Inflation Factor).
dropVariable = ['Unnamed: 0.1', 'Unnamed: 0', 'Parcel ID', 'Property Address',
'Suite/ Condo #', 'Property City', 'Sale Date', 'Legal Reference',
'Owner Name', 'Address', 'City', 'State', 'Neighborhood','image']
housingData = housingData.drop(dropVariable, axis=1) #drop unnecessary columns
housingData.info()
<class 'pandas.core.frame.DataFrame'> Index: 40280 entries, 0 to 56635 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Land Use 40280 non-null object 1 Sale Price 40280 non-null int64 2 Sold As Vacant 40280 non-null object 3 Multiple Parcels Involved in Sale 40280 non-null object 4 Acreage 20700 non-null float64 5 Tax District 20700 non-null object 6 Land Value 20700 non-null float64 7 Building Value 20700 non-null float64 8 Total Value 20700 non-null float64 9 Finished Area 19089 non-null float64 10 Foundation Type 19088 non-null object 11 Year Built 19089 non-null float64 12 Exterior Wall 19089 non-null object 13 Grade 19089 non-null object 14 Bedrooms 19078 non-null float64 15 Full Bath 19176 non-null float64 16 Half Bath 19074 non-null float64 dtypes: float64(9), int64(1), object(7) memory usage: 5.5+ MB
import plotly.express as px
fig = px.bar(housingData.isnull().sum().sort_values(ascending=False), color_discrete_sequence=["lightblue"])
fig.update_layout(showlegend=False,
xaxis_title="",
yaxis_title="Missing Value",
title={'text': "Figure 1: Number of Missing Values for each Column",
'x': 0.50,
'xanchor': 'center',
'yanchor': 'top',
'font': {'size': 14}},
margin={'t': 100})
fig.show()
Figure 1 displays the number of missing values present in the dataset in which there are missing values in Half Bath, Bedrooms, Foundation Type, Grade, Exterior Wall, Year Built, Finished Area, Full Bath, Total Value, Building Value, Land Value, Tax District, and Acreage.
housingData.describe()
| Sale Price | Acreage | Land Value | Building Value | Total Value | Finished Area | Year Built | Bedrooms | Full Bath | Half Bath | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 4.028000e+04 | 20700.000000 | 2.070000e+04 | 2.070000e+04 | 2.070000e+04 | 19089.000000 | 19089.000000 | 19078.000000 | 19176.000000 | 19074.000000 |
| mean | 3.663047e+05 | 0.464649 | 7.830631e+04 | 1.752922e+05 | 2.562645e+05 | 1986.433778 | 1961.542092 | 3.096918 | 1.910670 | 0.297106 |
| std | 1.081598e+06 | 0.957274 | 1.140855e+05 | 2.260268e+05 | 3.052314e+05 | 1849.770778 | 27.549181 | 0.888494 | 0.996996 | 0.500055 |
| min | 5.000000e+01 | 0.010000 | 1.000000e+02 | 0.000000e+00 | 1.000000e+02 | 0.000000 | 1799.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 1.428438e+05 | 0.180000 | 2.200000e+04 | 7.770000e+04 | 1.070000e+05 | 1251.250000 | 1945.000000 | 3.000000 | 1.000000 | 0.000000 |
| 50% | 2.300000e+05 | 0.260000 | 3.200000e+04 | 1.215000e+05 | 1.683500e+05 | 1672.000000 | 1957.000000 | 3.000000 | 2.000000 | 0.000000 |
| 75% | 3.600000e+05 | 0.450000 | 8.500000e+04 | 2.018000e+05 | 3.023000e+05 | 2293.739990 | 1974.000000 | 4.000000 | 2.000000 | 1.000000 |
| max | 5.427806e+07 | 51.340000 | 2.772000e+06 | 1.297180e+07 | 1.394040e+07 | 197988.000000 | 2017.000000 | 11.000000 | 10.000000 | 3.000000 |
Mode imputation is suitable because it efficiently handles missing values in categorical or ordinal data by replacing them with the most frequently occurring value (mode). Given that the dataset likely contains categorical variables, mode imputation ensures that missing values are replaced with the most prevalent category, preserving the integrity of the categorical features without introducing significant bias. Therefore, mode imputation is an appropriate choice for maintaining the distribution of categorical variables, including 'Half Bath', 'Bedrooms', 'Foundation Type', 'Grade', 'Exterior Wall', 'Year Built', 'Full Bath', and 'Tax District', while handling missing data efficiently.
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='most_frequent')
housingData[['Half Bath', 'Bedrooms', 'Foundation Type', 'Grade', 'Exterior Wall', 'Year Built', 'Full Bath',
'Tax District']] = imputer.fit_transform(housingData[['Half Bath', 'Bedrooms',
'Foundation Type', 'Grade', 'Exterior Wall', 'Year Built', 'Full Bath', 'Tax District']])
Median imputation is a suitable method for handling missing data when the data is skewed or contains outliers. Unlike mean imputation, which may be sensitive to outliers, median imputation is robust and less affected by extreme values. Therefore, using median imputation can help preserve the central tendency of the data and provide more accurate estimates, especially in the presence of skewed distributions or outliers.
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='median')
housingData[['Finished Area', 'Total Value', 'Building Value', 'Land Value',
'Acreage']] = imputer.fit_transform(housingData[['Finished Area',
'Total Value', 'Building Value', 'Land Value', 'Acreage']])
housingData.isnull().sum()
Land Use 0 Sale Price 0 Sold As Vacant 0 Multiple Parcels Involved in Sale 0 Acreage 0 Tax District 0 Land Value 0 Building Value 0 Total Value 0 Finished Area 0 Foundation Type 0 Year Built 0 Exterior Wall 0 Grade 0 Bedrooms 0 Full Bath 0 Half Bath 0 dtype: int64
There are 13801 duplicated value which should be dropped.
print('Duplicate data:', housingData.duplicated().sum())
Duplicate data: 13801
housingData = housingData.drop_duplicates() #remove rows with duplicates
print('Duplicate data:', housingData.duplicated().sum())
Duplicate data: 0
housingData.info()
<class 'pandas.core.frame.DataFrame'> Index: 26479 entries, 0 to 56633 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Land Use 26479 non-null object 1 Sale Price 26479 non-null int64 2 Sold As Vacant 26479 non-null object 3 Multiple Parcels Involved in Sale 26479 non-null object 4 Acreage 26479 non-null float64 5 Tax District 26479 non-null object 6 Land Value 26479 non-null float64 7 Building Value 26479 non-null float64 8 Total Value 26479 non-null float64 9 Finished Area 26479 non-null float64 10 Foundation Type 26479 non-null object 11 Year Built 26479 non-null object 12 Exterior Wall 26479 non-null object 13 Grade 26479 non-null object 14 Bedrooms 26479 non-null object 15 Full Bath 26479 non-null object 16 Half Bath 26479 non-null object dtypes: float64(5), int64(1), object(11) memory usage: 3.6+ MB
variables = ['Sale Price', 'Finished Area', 'Total Value', 'Building Value', 'Land Value',
'Acreage']
plt.figure(figsize=(15, 10))
for i, column in enumerate(variables, 1):
plt.subplot(3, 3, i)
sns.boxplot(x=column, data=housingData)
plt.title('Figure: Box Plots of {}'.format(column))
plt.xlabel('Data')
plt.ylabel('Values')
plt.tight_layout()
print("Figure 2: Boxplots of 6 variables showing their outliers")
plt.show()
Figure 2: Boxplots of 6 variables showing their outliers
Figure 2 above displays a combined boxplot of six variables, illustrating the outliers that need to be addressed. I selected only these six variables for removing their extreme outliers, as the other variables are either categorical or dummy variables (with only 0 and 1).
variables = ['Sale Price', 'Finished Area', 'Total Value', 'Building Value', 'Land Value',
'Acreage']
def removeOutliers(housingData, column):
Q1 = housingData[column].quantile(0.25)
Q3 = housingData[column].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
return housingData[(housingData[column] >= lower_bound) & (housingData[column] <= upper_bound)]
# Remove outliers for each variable
for var in variables:
housingData = removeOutliers(housingData, var)
# How many outliers were removed?
print("Shape after removing outliers:", housingData.shape)
plt.figure(figsize=(15, 10))
for i, column in enumerate(variables, 1):
plt.subplot(2, 3, i)
sns.boxplot(x=column, data=housingData)
plt.title('Box Plot of {}'.format(column))
plt.xlabel('Data')
plt.ylabel('Values')
plt.tight_layout()
print("Figure 3: Boxplots of 6 variables after removing extreme outliers")
plt.show()
Shape after removing outliers: (12436, 17) Figure 3: Boxplots of 6 variables after removing extreme outliers
Figure 3 shows the 6 boxplots for those 6 variables after removing 'far and extreme' outliers. Removing all outliers may risk losing valuable insights or distorting the data distribution. Therefore, by specifically targeting 'far and extreme outliers' for removal, we aim to maintain a balance between data cleanliness and preserving its integrity for analysis.
variables = ['Sale Price', 'Total Value']
def removeOutliers(housingData, column):
Q1 = housingData[column].quantile(0.25)
Q3 = housingData[column].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
return housingData[(housingData[column] >= lower_bound) & (housingData[column] <= upper_bound)]
# Remove outliers for each variable
for var in variables:
housingData = removeOutliers(housingData, var)
# How many outliers were removed?
print("Shape after removing outliers:", housingData.shape)
plt.figure(figsize=(15, 10))
for i, column in enumerate(variables, 1):
plt.subplot(5, 4, i)
sns.boxplot(x=column, data=housingData)
plt.title('Box Plot of {}'.format(column))
plt.xlabel('Data')
plt.ylabel('Values')
plt.tight_layout()
print("Figure 4: Boxplots of 'Sale Price' and 'Total Value' variables after removing outliers")
plt.show()
Shape after removing outliers: (11473, 17) Figure 4: Boxplots of 'Sale Price' and 'Total Value' variables after removing outliers
In Figure 4 above, outliers were removed from 'Sale Price' and 'Total Value' to mitigate bias when creating a new variable based on these two features.
After removing the outliers, the dataset consists of 11473 samples and 17 variables.
# Plot the distribution of Sale Price
plt.figure(figsize=(10, 6))
sns.histplot(housingData['Sale Price'], kde=True)
plt.title('Figure 5: Distribution of Sale Price')
plt.xlabel('Sale Price')
plt.ylabel('Count')
plt.show()
Figure 5 displays the distribution of Sale Price. It is evident that the majority of properties in the dataset have sale prices ranging from 100,000 to 200,000, with some reaching as high as 400,000. The figure illustrates that the 'Sale Price' variable follows a normal distribution.
There is a concern that houses are being sold at prices exceeding their asking prices, prompting the need to build an appropriate model to identify whether a property is overpriced or underpriced.
# Problem: There migh be a concern that houses are going over their asking prices.
salePrice = housingData['Sale Price']
totalValue = housingData['Total Value']
# Means of sale price and total value
meanSalePrice = salePrice.mean()
meanTotalValue = totalValue.mean()
# Plot bar chart
plt.bar(['Sale Price', 'Total Value'], [meanSalePrice, meanTotalValue], color=['lightblue', 'grey'])
plt.title('Figure 6: Problem - Houses are Going over their Asking Prices')
plt.ylabel('Mean Value')
plt.show()
Figure 6 illustrates the difference in the mean sale price and mean total value of properties in Nashville. The blue bar represents the mean sale price, while the grey bar represents the mean total value. The mean sale price bar (blue) is notably higher than the mean total value bar (grey), it suggests that, on average, properties are selling for more than their total value. This would align with the concern that houses are going over their asking price.
housingData.info()
<class 'pandas.core.frame.DataFrame'> Index: 11473 entries, 0 to 56625 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Land Use 11473 non-null object 1 Sale Price 11473 non-null int64 2 Sold As Vacant 11473 non-null object 3 Multiple Parcels Involved in Sale 11473 non-null object 4 Acreage 11473 non-null float64 5 Tax District 11473 non-null object 6 Land Value 11473 non-null float64 7 Building Value 11473 non-null float64 8 Total Value 11473 non-null float64 9 Finished Area 11473 non-null float64 10 Foundation Type 11473 non-null object 11 Year Built 11473 non-null object 12 Exterior Wall 11473 non-null object 13 Grade 11473 non-null object 14 Bedrooms 11473 non-null object 15 Full Bath 11473 non-null object 16 Half Bath 11473 non-null object dtypes: float64(5), int64(1), object(11) memory usage: 1.6+ MB
# Label encoding for categorical variables in dataset
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
categoricalVar = ['Land Use', 'Sold As Vacant', 'Multiple Parcels Involved in Sale',
'Tax District', 'Foundation Type', 'Year Built', 'Exterior Wall',
'Grade', 'Bedrooms', 'Full Bath', 'Half Bath']
# Apply label encoding for each categorical column
for var in categoricalVar:
housingData[var] = label_encoder.fit_transform(housingData[var])
For sale prices higher than the total value, it represents 1, indicating overpricing. For sale prices less than the total value, it represents 0, indicating underpricing.
# Difference between sale price and total value
housingData['Price Dif'] = housingData['Sale Price'] - housingData['Total Value']
# Determine over/underpricing
housingData['Over Priced'] = (housingData['Price Dif'] > 0).astype(int)
housingData['Under Priced'] = (housingData['Price Dif'] < 0).astype(int)
housingData['Price Category'] = housingData['Over Priced'] - housingData['Under Priced']
# Create a dependent variable to understand whether it is over/under the price
# Assign 0 to underpriced and 1 to overpriced
housingData['Price Category'] = housingData['Price Category'].apply(lambda x: 1 if x == 1 else 0)
matrix = housingData.corr()
f, ax = plt.subplots(figsize=(20, 13))
sns.heatmap(matrix, vmax=1, square=True, cmap="BuPu", annot=True)
plt.title('Figure 7: Correlation Matrix', fontsize=16)
plt.show()
The correlation matrix (Figure 7) shows the pairwise correlations between Price Category and other variables in the dataset.
noCorrVar = ['Acreage', 'Full Bath']
housingData = housingData.drop(noCorrVar, axis=1) #drop columns with no correlation with target variable
extremeCorrVar = ['Over Priced', 'Under Priced']
housingData = housingData.drop(extremeCorrVar, axis=1) #drop columns with extreme correlation with target variable
from statsmodels.stats.outliers_influence import variance_inflation_factor
def calc_vif(X): # Calculating VIF
vif = pd.DataFrame()
vif["variables"] = X.columns
vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif = vif.sort_values(by='VIF', ascending=False).reset_index(drop=True)
return(vif)
X = housingData.iloc[:,:-1]
print("Table 1: VIF")
calc_vif(X)
Table 1: VIF
| variables | VIF | |
|---|---|---|
| 0 | Sale Price | inf |
| 1 | Total Value | inf |
| 2 | Price Dif | inf |
| 3 | Building Value | 544.312083 |
| 4 | Land Value | 82.187833 |
| 5 | Tax District | 55.330187 |
| 6 | Finished Area | 48.499363 |
| 7 | Bedrooms | 38.764152 |
| 8 | Grade | 18.210464 |
| 9 | Land Use | 17.860565 |
| 10 | Year Built | 12.111708 |
| 11 | Exterior Wall | 2.003503 |
| 12 | Sold As Vacant | 1.503993 |
| 13 | Foundation Type | 1.407603 |
| 14 | Half Bath | 1.314381 |
| 15 | Multiple Parcels Involved in Sale | 1.254161 |
Multicollinearity is a critical consideration before employing logistic regression, as it violates the assumption of independence among predictors. To assess multicollinearity, I opted to calculate the Variance Inflation Factor (VIF), showing in Table 1. According to Husnoo (2020), variables with a VIF score exceeding 5 exhibit strong correlation. Hence, these variables are deemed highly correlated and should be omitted from the logistic regression model to mitigate multicollinearity issues.
dropVar = ['Sale Price', 'Total Value', 'Price Dif', 'Building Value', 'Land Value',
'Tax District', 'Bedrooms', 'Grade', 'Land Use', 'Year Built']
housing = housingData.drop(dropVar, axis=1)
housing.info()
<class 'pandas.core.frame.DataFrame'> Index: 11473 entries, 0 to 56625 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Sold As Vacant 11473 non-null int64 1 Multiple Parcels Involved in Sale 11473 non-null int64 2 Finished Area 11473 non-null float64 3 Foundation Type 11473 non-null int64 4 Exterior Wall 11473 non-null int64 5 Half Bath 11473 non-null int64 6 Price Category 11473 non-null int64 dtypes: float64(1), int64(6) memory usage: 717.1 KB
The logistic regression model tells us that the three variables have significantly impact on determining the overpricing/underpricing property are 'Sold As Vacant', 'Foundation Type', and 'Finished Area'. The accuracy of the logistic model is 0.54, suggesting that it correctly predicts the class label for 54% of the properties. However, the weighted average F1-score is 0.46, which indicates a balance between precision and recall for both classes.
housing['Price Category'].astype('category').value_counts()
Price Category 1 8312 0 3161 Name: count, dtype: int64
There are 8312 samples represents overpricing while there are only 3161 samples represents underpricing.
# Visualize Imbalanced Data
fig, ax = plt.subplots()
ax.pie(
housing['Price Category'].value_counts().values,
labels=["1","0"],
autopct="%1.1f%%",
explode=(0, 0.1),
shadow=True,
colors=['lightblue', 'grey']
)
ax.set_title('Figure 8: Data Imbalanced', fontsize=12)
plt.show()
The classes 0 and 1 are imbalanced; therefore, we need to balance the data.
# Splitting dataset into trainset and testset
X = housing.drop('Price Category', axis=1)
y = housing['Price Category']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42, stratify=y)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
((8031, 6), (3442, 6), (8031,), (3442,))
When employing stratify=y in train_test_split, the function attempts to maintain class proportions in the target variable across training and testing sets. However, in highly imbalanced datasets where one class dominates, stratification may fail to effectively balance class distributions. This occurs when there are insufficient instances of minority classes to ensure representative splits. Consequently, despite specifying stratify=y, imbalanced class distributions persist, particularly in the smaller subset. In such cases, additional techniques like oversampling and undersampling become essential to mitigate class imbalance and enhance model performance. I will eventually use the oversampling method to balance the data.
print('Labels count in y:', np.bincount(y))
print('Labels count in y_train:', np.bincount(y_train))
print('Labels count in y_test:', np.bincount(y_test))
Labels count in y: [3161 8312] Labels count in y_train: [2213 5818] Labels count in y_test: [ 948 2494]
I standardized both the training and testing sets and addressed the class imbalance in the dataset to adjust the disproportionate representation of different classes, thus optimizing the effectiveness of the models.
# Standardization
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
# Oversampling
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler()
X_train_balanced, y_train_balanced = ros.fit_resample(X_train, y_train)
X_test_balanced, y_test_balanced = ros.fit_resample(X_test, y_test)
X_train_balanced.shape, y_train_balanced.shape, X_test_balanced.shape, y_test_balanced.shape
((11636, 6), (11636,), (4988, 6), (4988,))
# Visualize Balanced Data
fig, ax = plt.subplots()
ax.pie(
y_train_balanced.value_counts().values,
labels=["1","0"],
autopct="%1.1f%%",
explode=(0, 0.1),
shadow=True,
colors=['lightblue', 'grey']
)
ax.set_title('Figure 9: Data Balanced', fontsize=12)
plt.show()
X.head(1)
| Sold As Vacant | Multiple Parcels Involved in Sale | Finished Area | Foundation Type | Exterior Wall | Half Bath | |
|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 1672.0 | 0 | 0 | 0 |
# Fit the logistic regression model
model = sm.Logit(y_train_balanced, X_train_balanced)
Logit = model.fit()
print("Table 2: Logit Regression Summary")
print(Logit.summary())
Optimization terminated successfully.
Current function value: 0.675069
Iterations 5
Table 2: Logit Regression Summary
Logit Regression Results
==============================================================================
Dep. Variable: Price Category No. Observations: 11636
Model: Logit Df Residuals: 11630
Method: MLE Df Model: 5
Date: Sun, 24 Mar 2024 Pseudo R-squ.: 0.02608
Time: 19:53:08 Log-Likelihood: -7855.1
converged: True LL-Null: -8065.5
Covariance Type: nonrobust LLR p-value: 1.012e-88
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
x1 -0.3544 0.022 -16.352 0.000 -0.397 -0.312
x2 0.0161 0.020 0.790 0.430 -0.024 0.056
x3 -0.0489 0.021 -2.287 0.022 -0.091 -0.007
x4 0.0556 0.020 2.783 0.005 0.016 0.095
x5 0.0190 0.021 0.900 0.368 -0.022 0.060
x6 -0.0067 0.019 -0.347 0.729 -0.045 0.031
==============================================================================
The Pseudo R-squared value indicates the proportion of variance explained by the model. In this case, it's approximately 0.02608, suggesting a low level of explanatory power. The log-likelihood value represents the maximized value of the likelihood function for the model which is about -7855.1.
# Get the coefficients from the logistic regression model
coefficients = Logit.params
# Get the absolute values of coefficients for feature importance
abs_coefficients = np.abs(coefficients)
# Calculate p-values
p_values = Logit.pvalues
# Get the feature names
feature_names = X.columns
# Create a DataFrame to store the results
results = pd.DataFrame({
'Feature': feature_names,
'Coefficient': coefficients,
'Absolute Coefficient': abs_coefficients,
'P-value': p_values
})
# Sort the DataFrame by absolute coefficient values
results = results.sort_values(by='Absolute Coefficient', ascending=False)
print("Table 3: Coefficient, Abs. Coefficient, and p-values of Variables in Logistic Regression")
results
Table 3: Coefficient, Abs. Coefficient, and p-values of Variables in Logistic Regression
| Feature | Coefficient | Absolute Coefficient | P-value | |
|---|---|---|---|---|
| x1 | Sold As Vacant | -0.354387 | 0.354387 | 4.205507e-60 |
| x4 | Foundation Type | 0.055575 | 0.055575 | 5.384554e-03 |
| x3 | Finished Area | -0.048890 | 0.048890 | 2.221791e-02 |
| x5 | Exterior Wall | 0.018985 | 0.018985 | 3.680185e-01 |
| x2 | Multiple Parcels Involved in Sale | 0.016054 | 0.016054 | 4.295952e-01 |
| x6 | Half Bath | -0.006700 | 0.006700 | 7.288390e-01 |
The table presents the coefficients, absolute coefficients, and p-values for each feature in the logistic regression model. Notably, "Sold As Vacant" has the highest absolute coefficient of -0.354387, indicating its negative relationship with the outcome variable. Additionally, it demonstrates a very low p-value of 4.205507e-60 (less than 0.05), signifying its high statistical significance. With the p-value of 5.384554e-03 and 2.221791e-02 (less than 0.05), 'Foundation Type' and 'Finished Area' are also statistical significant. Conversely, "Half Bath" exhibits the lowest absolute coefficient of 0.016463 and a relatively high p-value of 3.942181e-01, suggesting it has less influence and no statistical significance in predicting the outcome variable. Other features like "Multiple Parcels Involved in Sale" and "Exterior Wall" also display high p-values, indicating their insignificant in the model. Therefore, there are three variables that are significant as follow.
# p-values less than 0.05
significant_coefficients = coefficients[p_values < 0.05]
significant_abs_coefficients = np.abs(significant_coefficients)
significant_feature_names = feature_names[p_values < 0.05]
# Sort features based on their absolute coefficients
sorted_indices = np.argsort(significant_abs_coefficients)
sorted_features = significant_feature_names[sorted_indices]
sorted_coefficients = significant_abs_coefficients[sorted_indices]
# Bar chart for feature importance
plt.figure(figsize=(10, 2))
plt.barh(sorted_features, sorted_coefficients, color='lightblue')
plt.title('Figure 10: Feature Importance (Logistic Regression)')
plt.xlabel('Absolute Coefficient')
plt.tight_layout()
plt.show()
Figure 10 illustrates the feature importance of the logistic regression in which 'Sold As Vacant' is the most important feature, significanly higher than the 'Foundation Type' and 'Finished Area'. The coefficient of -0.354387 for the feature 'Sold As Vacant' indicates the estimated effect of this feature on the dependent variable in the statistical model. Specifically, it suggests that properties sold as vacant tend to have a decrease in sale price by approximately 0.354387 units for every unit increase in the "Sold As Vacant" feature.
# Make predictions
y_pred = Logit.predict(X_test_balanced)
# Scatterplot of Predicted vs. Actual Values
plt.scatter(y_pred, y_test_balanced, color='lightblue')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Figure 11: Scatter Plot of Predicted vs Actual Values of Logistic Regression')
plt.show()
The objective behind hyperparameter tuning for a logistic regression classifier through grid search with cross-validation is to enhance the classifier's performance by identifying the most effective combination of hyperparameters (Okamura, 2020). This iterative process systematically explores various parameter values and assesses the model's performance using cross-validation, aiming to enhance both accuracy and generalization capabilities. Moreover, the accompanying plot offers a visual representation of the cross-validation results, shedding light on the model's performance under different parameter configurations.
modelsResult = pd.DataFrame({
'Model': [],
'Accuracy': [],
'Precision': [],
'Recall': []
})
def concat_result(df, y_pred, model):
newModel = pd.DataFrame({
'Model': [model],
'Accuracy': [accuracy_score(y_pred=y_pred, y_true=y_test_balanced)],
'Precision': [precision_score(y_pred=y_pred, y_true=y_test_balanced)],
'Recall': [recall_score(y_pred=y_pred, y_true=y_test_balanced)]
})
modelsResult = pd.concat([df, newModel], axis=0, ignore_index=True)
return modelsResult
# Hyperparameters for logistic regression
lr_params = {
"penalty": ['l1', 'l2'],
"C": [0.001, 0.01, 0.1, 1, 10, 100],
"solver": ['saga', 'liblinear']
}
kf = KFold(n_splits=5, shuffle=True, random_state=42)
# GridSearchCV
clf_lr = GridSearchCV(
estimator=LogisticRegression(max_iter=1000),
param_grid=lr_params,
scoring='accuracy',
cv=kf
)
clf_lr.fit(X_train_balanced, y_train_balanced)
print(f"Best hiperparams of Logistic Regression: \n{clf_lr.best_estimator_}")
Best hiperparams of Logistic Regression: LogisticRegression(C=0.001, max_iter=1000, solver='saga')
# Make predictions
y_pred_lr = clf_lr.predict(X_test_balanced)
# Confusion Matrix
cm_lr = confusion_matrix(y_test_balanced,y_pred_lr)
cmap = sns.light_palette("lightblue", as_cmap=True)
sns.heatmap(cm_lr, annot=True, fmt="d", cmap=cmap)
plt.title('Figure 12: Confusion Matrix of Logistic Regression')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()
The confusion matrix provides insight into the performance of a classification model distinguishing between underpriced and overpriced properties. In this context, where 0 denotes underpriced properties and 1 indicates overpriced ones based on the difference between sale price and total value, the matrix reveals the following: The model accurately identified 365 instances as underpriced properties (0) when they were indeed underpriced, demonstrating its ability to correctly classify such cases (TN). However, it misclassified 144 instances as underpriced properties (0) when they were actually overpriced, indicating FN. Moreover, the model erroneously labeled 2129 instances as overpriced properties (1) when they were underpriced, representing false positives. On a positive note, the model successfully identified 2350 instances as overpriced properties (1) when they were indeed overpriced, showcasing its capability to detect such cases (TP). Despite its proficiency in identifying overpriced properties, the model's tendency to misclassify underpriced properties may warrant further refinement to enhance its predictive accuracy and reliability, ensuring more informed decision-making in real estate investments.
# Import classification report
from sklearn.metrics import classification_report
print('Table 4: Classification Report of the Logistic Regression \n')
print(classification_report(y_test_balanced,y_pred_lr))
Table 4: Classification Report of the Logistic Regression
precision recall f1-score support
0 0.72 0.15 0.24 2494
1 0.52 0.94 0.67 2494
accuracy 0.54 4988
macro avg 0.62 0.54 0.46 4988
weighted avg 0.62 0.54 0.46 4988
Table 4 shows the classification report, providing insights into the performance of a model in predicting underpriced (0) and overpriced (1) properties. For underpriced properties (0), the precision is 0.72, indicating that when the model predicts a property as underpriced, it is correct 72% of the time. However, the recall is only 0.15, meaning that the model only identifies 15% of all underpriced properties. For overpriced properties (1), the precision is 0.52, indicating that when the model predicts a property as overpriced, it is correct 52% of the time. The recall is higher at 0.94, meaning that the model successfully identifies 94% of all overpriced properties. Overall, the model's accuracy is 0.54, suggesting that it correctly predicts the class label for 54% of the properties. However, the weighted average F1-score is 0.46, which indicates a balance between precision and recall for both classes.
modelsResult = concat_result(modelsResult, y_pred_lr, 'Logistic Regression')
# Hyperparameters for decision tree classifier
tree_params = {
'criterion': ["gini", "entropy"],
'splitter': ["best", "random"],
'min_samples_split': [2, 3, 5]
}
# GridSearchCV
clf_tree = GridSearchCV(
estimator=DecisionTreeClassifier(),
param_grid=tree_params,
scoring='accuracy',
cv=kf
)
clf_tree.fit(X_train_balanced, y_train_balanced)
print(f"Best hiperparams of Decission Tree model: \n{clf_tree.best_estimator_}")
Best hiperparams of Decission Tree model: DecisionTreeClassifier(criterion='entropy')
# Fit the decision tree model wih the best estimator from GridSearchCV
best_tree = clf_tree.best_estimator_
# Get feature importances from the best decision tree model
feature_importances = best_tree.feature_importances_
feature_names = X.columns
# Sort feature importances
sorted_indices = np.argsort(feature_importances)
sorted_features = feature_names[sorted_indices]
sorted_feature_importances = feature_importances[sorted_indices]
# Plot feature importances
plt.figure(figsize=(10, 5))
plt.barh(sorted_features, sorted_feature_importances, color='lightblue')
plt.title('Figure 13: Feature Importance (Decision Tree)')
plt.xlabel('Importance')
plt.tight_layout()
plt.show()
Figure 13 illustrates the six important features of the decision tree model: 'Finished Area', 'Exterior Wall', 'Foundation Type', 'Sold As Vacant', 'Half Bath', and 'Multiple Parcels Involved in Sale'. Notably, 'Finished Area' emerges as nearly 7 times more important than the other variables.
# Make predictions
y_pred_tree = best_tree.predict(X_test_balanced)
# Confusion Matrix
cm_tree = confusion_matrix(y_test_balanced, y_pred_tree)
cmap = sns.light_palette("lightblue", as_cmap=True)
sns.heatmap(cm_tree, annot=True, fmt="d", cmap=cmap)
plt.title('Figure 14: Confusion Matrix of Decision Tree Classifier')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()
The model correctly predicted 1685 instances (TN) as class 0 when they were indeed class 0, indicating correct classifications of properties that are not overpriced. While the model performs reasonably well in identifying overpriced properties (1052 True Positives), it struggles with misclassifying some overpriced properties as not overpriced (1442 False Negatives) and incorrectly labeling some not overpriced properties as overpriced (809 False Positives). This indicates areas where the model's predictions may be improved to enhance its overall performance in accurately identifying overpriced properties.
modelsResult = concat_result(modelsResult, y_pred_tree, 'Decision Tree Classifier')
# Hyperparameters for random forest classifier
rf_params = {
"n_estimators": [70, 90, 110],
"criterion": ['gini', 'entropy'],
'min_samples_split': [2, 3, 5]
}
# GridSearchCV
clf_rf = GridSearchCV(
estimator=RandomForestClassifier(),
param_grid=rf_params,
scoring='accuracy',
cv=kf
)
clf_rf.fit(X_train_balanced, y_train_balanced)
print(f"Best hiperparams of Random Forest model: \n{clf_rf.best_estimator_}")
Best hiperparams of Random Forest model: RandomForestClassifier(min_samples_split=3, n_estimators=70)
# Fit the decision tree model wih the best estimator from GridSearchCV
best_rf = clf_rf.best_estimator_
# Get feature importances from the best decision tree model
feature_importances = best_rf.feature_importances_
feature_names = X.columns
# Sort feature importances
sorted_indices = np.argsort(feature_importances)
sorted_features = feature_names[sorted_indices]
sorted_feature_importances = feature_importances[sorted_indices]
# Plot feature importances
plt.figure(figsize=(10, 5))
plt.barh(sorted_features, sorted_feature_importances, color='lightblue')
plt.title('Figure 15: Feature Importance (Random Forest Model)')
plt.xlabel('Importance')
plt.tight_layout()
plt.show()
Figure 15 illustrates the six important features of the decision tree model: 'Finished Area', 'Sold As Vacant', 'Exterior Wall', 'Foundation Type', 'Half Bath', and 'Multiple Parcels Involved in Sale'. Remarkably, 'Finished Area' stands out as over 8 times more important than the other variables. On the other hand, 'Half Bath' and 'Multiple Parcels Involved in Sale' are not much important.
# Make predictions
y_pred_rf = best_rf.predict(X_test_balanced)
# Confusion Matrix
cm_rf = confusion_matrix(y_pred=y_pred_rf, y_true=y_test_balanced)
cmap = sns.light_palette("lightblue", as_cmap=True)
sns.heatmap(cm_rf, annot=True, fmt="d", cmap=cmap)
plt.title('Figure 16: Confusion Matrix of Random Forest Classifier')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()
Based on the figure 16, we can conclude that the model performed reasonably well in identifying properties that are not overpriced (TN: 1642) but struggled more with correctly identifying overpriced properties (TP: 1092) and had a significant number of misclassifications in both directions (FP: 852 and FN: 1402). This suggests that while the model has some effectiveness, there is room for improvement, particularly in reducing false predictions.
modelsResult = concat_result(modelsResult, y_pred_rf, 'Random Forest Classifier')
# Hyperparameters for gradient boosting classifier
gb_params = {
'n_estimators': [50, 100, 150],
'learning_rate': [0.01, 0.1, 0.5],
'max_depth': [3, 5, 7]
}
# GridSearchCV
clf_gb = GridSearchCV(
estimator=GradientBoostingClassifier(),
param_grid=gb_params,
scoring='accuracy',
cv=kf
)
clf_gb.fit(X_train_balanced, y_train_balanced)
print(f"Best hiperparams of Gradient Boost model: \n{clf_gb.best_estimator_}")
Best hiperparams of Gradient Boost model: GradientBoostingClassifier(learning_rate=0.5, max_depth=7, n_estimators=150)
# Fit the gradient boost model wih the best estimator from GridSearchCV
best_gb = clf_gb.best_estimator_
# Get feature importances from the best gradient boost model
feature_importances = best_gb.feature_importances_
feature_names = X.columns
# Sort feature importances
sorted_indices = np.argsort(feature_importances)
sorted_features = feature_names[sorted_indices]
sorted_feature_importances = feature_importances[sorted_indices]
# Plot feature importances
plt.figure(figsize=(10, 5))
plt.barh(sorted_features, sorted_feature_importances, color='lightblue')
plt.title('Figure 17: Feature Importance (Gradient Boost Model)')
plt.xlabel('Importance')
plt.tight_layout()
plt.show()
Figure 17 shows the six important features of the decision tree model: 'Finished Area', 'Exterior Wall', 'Foundation Type', 'Sold As Vacant', 'Half Bath', and 'Multiple Parcels Involved in Sale'. While 'Multiple Parcels Involved in Sale' are not significantly important, 'Finished Area' stands out as over 6 or 7 times more important than the other variables.
# Make predictions
y_pred_gb = clf_gb.predict(X_test_balanced)
# Confusion Matrix
cm_gb = confusion_matrix(y_pred=y_pred_gb, y_true=y_test_balanced)
cmap = sns.light_palette("lightblue", as_cmap=True)
sns.heatmap(cm_gb, annot=True, fmt="d", cmap=cmap)
plt.title('Figure 18: Confusion Matrix of Gradient Boosting Classifier')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()
The model correctly predicted 1612 instances (TN) as class 0 when they were indeed class 0, indicating accurate classifications of properties that are not overpriced. The model demonstrates reasonably good performance in identifying overpriced properties, as indicated by the high number of TP (1149). However, it still struggles with misclassifying some overpriced properties as not overpriced (1345 FN) and incorrectly labeling some not overpriced properties as overpriced (882 FP). These areas of misclassification highlight potential areas for improvement to enhance the model's accuracy in identifying overpriced properties more effectively.
modelsResult = concat_result(modelsResult, y_pred_gb, 'Gradient Boosting Classifier')
# Hyperparameters for neural network
nn_params = {
'hidden_layer_sizes': [(50,), (100,), (50, 50), (100, 100)],
'activation': ['relu', 'tanh', 'logistic'],
'solver': ['adam', 'sgd'],
'learning_rate_init': [0.001, 0.01, 0.1]
}
from sklearn.model_selection import RandomizedSearchCV
# GridSearchCV
random_search_nn = RandomizedSearchCV(
estimator=MLPClassifier(),
param_distributions=nn_params,
scoring='accuracy',
cv=kf,
n_iter=10
)
random_search_nn.fit(X_train_balanced, y_train_balanced)
print(f"Best hiperparams of Neural Network model: \n{random_search_nn.best_estimator_}")
Best hiperparams of Neural Network model:
MLPClassifier(hidden_layer_sizes=(100, 100), learning_rate_init=0.1,
solver='sgd')
# Make predictions
y_pred_nn = random_search_nn.predict(X_test_balanced)
# Confusion Matrix
cm_nn = confusion_matrix(y_pred=y_pred_nn, y_true=y_test_balanced)
cmap = sns.light_palette("lightblue", as_cmap=True)
sns.heatmap(cm_nn, annot=True, fmt="d", cmap=cmap)
plt.title('Figure 19: Confusion Matrix of Neural Network')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()
modelsResult = concat_result(modelsResult, y_pred_nn, 'Neural Network')
Based on the evaluation metrics, the real estate company should consider using the Logistic Regression model due to its high recall score of 0.942, indicating its effectiveness in identifying overpriced properties. Key variables such as "Sold As Vacant," "Finished Area," and "Exterior Wall" play significant roles in determining property value. Leveraging the model predictions, the company can target overpriced properties for negotiation and focus on properties with desirable characteristics to maximize value and investment returns.
modelsResult5 = modelsResult.sort_values(by='Accuracy', ascending=False)
print("Table 5: Accuracy, Precision, and Recall of 5 Models (Sorted by Accuracy)")
modelsResult5
Table 5: Accuracy, Precision, and Recall of 5 Models (Sorted by Accuracy)
| Model | Accuracy | Precision | Recall | |
|---|---|---|---|---|
| 4 | Neural Network | 0.562149 | 0.565845 | 0.534082 |
| 3 | Gradient Boosting Classifier | 0.553528 | 0.565731 | 0.460706 |
| 1 | Decision Tree Classifier | 0.548717 | 0.565287 | 0.421812 |
| 2 | Random Forest Classifier | 0.548115 | 0.561728 | 0.437851 |
| 0 | Logistic Regression | 0.544306 | 0.524671 | 0.942261 |
# Sort modelsResult by Recall
modelsResult_sorted = modelsResult.sort_values(by='Recall', ascending=True)
bar_width = 0.25
num_models = len(modelsResult_sorted)
index = np.arange(num_models) # Set the positions for the bars
# Plot
plt.figure(figsize=(12, 8))
plt.barh(index, modelsResult_sorted['Accuracy'], bar_width, color='skyblue', label='Accuracy')
plt.barh(index + bar_width, modelsResult_sorted['Precision'], bar_width, color='lightgreen', label='Precision')
plt.barh(index + 2*bar_width, modelsResult_sorted['Recall'], bar_width, color='grey', label='Recall')
plt.xlabel('Score')
plt.title('Figure 20: Accuracy, Precision, and Recall of 5 Models (Sorted by Recall)')
plt.yticks(index + bar_width, modelsResult_sorted['Model'])
plt.legend()
plt.show()
Figure 20 displays the accuracy, precision, and recall scores for the 5 models. These models are sorted by recall, as their accuracy and precision do not vary significantly. It is evident that logistic regression has the highest recall score compared to the other 4 models.
The ensemble model slightly outperforms individual models in accuracy and precision but lags in recall compared to logistic regression, crucial for identifying overpriced properties. While the ensemble combines multiple models' strengths, logistic regression's high recall makes it more preferable for pinpointing overpriced properties, ensuring better profitability for the real estate company.
from sklearn.ensemble import VotingClassifier
# Define the ensemble using VotingClassifier
ensemble_model = VotingClassifier(
estimators=[
('logistic', clf_lr),
('decision_tree', clf_tree),
('random_forest', clf_rf),
('gradient_boosting', clf_gb),
('neural_network', random_search_nn)
],
voting='hard' # majority voting
)
# Fit the ensemble model
ensemble_model.fit(X_train_balanced, y_train_balanced)
VotingClassifier(estimators=[('logistic',
GridSearchCV(cv=KFold(n_splits=5, random_state=42, shuffle=True),
estimator=LogisticRegression(max_iter=1000),
param_grid={'C': [0.001, 0.01, 0.1,
1, 10, 100],
'penalty': ['l1', 'l2'],
'solver': ['saga',
'liblinear']},
scoring='accuracy')),
('decision_tree',
GridSearchCV(cv=KFold(n_splits=5, random_state=42, shuffle=True),
estima...
'n_estimators': [50, 100,
150]},
scoring='accuracy')),
('neural_network',
RandomizedSearchCV(cv=KFold(n_splits=5, random_state=42, shuffle=True),
estimator=MLPClassifier(),
param_distributions={'activation': ['relu',
'tanh',
'logistic'],
'hidden_layer_sizes': [(50,),
(100,),
(50,
50),
(100,
100)],
'learning_rate_init': [0.001,
0.01,
0.1],
'solver': ['adam',
'sgd']},
scoring='accuracy'))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. VotingClassifier(estimators=[('logistic',
GridSearchCV(cv=KFold(n_splits=5, random_state=42, shuffle=True),
estimator=LogisticRegression(max_iter=1000),
param_grid={'C': [0.001, 0.01, 0.1,
1, 10, 100],
'penalty': ['l1', 'l2'],
'solver': ['saga',
'liblinear']},
scoring='accuracy')),
('decision_tree',
GridSearchCV(cv=KFold(n_splits=5, random_state=42, shuffle=True),
estima...
'n_estimators': [50, 100,
150]},
scoring='accuracy')),
('neural_network',
RandomizedSearchCV(cv=KFold(n_splits=5, random_state=42, shuffle=True),
estimator=MLPClassifier(),
param_distributions={'activation': ['relu',
'tanh',
'logistic'],
'hidden_layer_sizes': [(50,),
(100,),
(50,
50),
(100,
100)],
'learning_rate_init': [0.001,
0.01,
0.1],
'solver': ['adam',
'sgd']},
scoring='accuracy'))])LogisticRegression(max_iter=1000)
LogisticRegression(max_iter=1000)
DecisionTreeClassifier()
DecisionTreeClassifier()
RandomForestClassifier()
RandomForestClassifier()
GradientBoostingClassifier()
GradientBoostingClassifier()
MLPClassifier()
MLPClassifier()
# Make predictions using the ensemble model
y_pred_ensemble = ensemble_model.predict(X_test_balanced)
modelsResult = concat_result(modelsResult, y_pred_ensemble, 'Ensemble')
modelsResult6 = modelsResult.sort_values(by='Accuracy', ascending=False)
print("Table 5: Accuracy, Precision, and Recall of Models of 6 Models (Sorted by Accuracy)")
modelsResult6
Table 5: Accuracy, Precision, and Recall of Models of 6 Models (Sorted by Accuracy)
| Model | Accuracy | Precision | Recall | |
|---|---|---|---|---|
| 4 | Neural Network | 0.562149 | 0.565845 | 0.534082 |
| 5 | Ensemble | 0.558741 | 0.569729 | 0.479952 |
| 3 | Gradient Boosting Classifier | 0.553528 | 0.565731 | 0.460706 |
| 1 | Decision Tree Classifier | 0.548717 | 0.565287 | 0.421812 |
| 2 | Random Forest Classifier | 0.548115 | 0.561728 | 0.437851 |
| 0 | Logistic Regression | 0.544306 | 0.524671 | 0.942261 |
The table 5 provides a summary of the performance metrics (accuracy, precision, and recall) for 6 different machine learning models applied to the dataset (sorted by Accuracy):
# Sort modelsResult by Recall
modelsResult_sorted = modelsResult.sort_values(by='Recall', ascending=True)
bar_width = 0.25
num_models = len(modelsResult_sorted) # Get the number of models
index = np.arange(num_models)
# Plot
plt.figure(figsize=(12, 8))
plt.barh(index, modelsResult_sorted['Accuracy'], bar_width, color='skyblue', label='Accuracy')
plt.barh(index + bar_width, modelsResult_sorted['Precision'], bar_width, color='lightgreen', label='Precision')
plt.barh(index + 2*bar_width, modelsResult_sorted['Recall'], bar_width, color='grey', label='Recall')
plt.xlabel('Score')
plt.title('Figure 21: Summary of Accuracy, Precision, and Recall of Models (Sorted by Recal)')
plt.yticks(index + bar_width, modelsResult_sorted['Model'])
plt.legend()
plt.show()
The Figure 21 illustrates the results of accuracy, precision, and recall scores for the 6 models in which the 6 models have approximately accuracy score and precision. Therefore, I decided to compare their recall. As we can observe, logistic regression has the highest recall which is nearly 94.23% even though the model is slightly lower in accuracy and precision compared to other models.
Based on the evaluation metrics, the ensemble model performs slightly better than individual models in terms of accuracy and precision, with an accuracy of 0.559 and precision of 0.570. However, the ensemble model's recall score is lower than that of the logistic regression model, indicating that it may not be as effective in identifying overpriced properties. Nonetheless, the ensemble approach combines the strengths of multiple models, providing a more robust prediction framework. Therefore, while the ensemble model improves overall performance, the logistic regression model remains preferable for its high recall score in identifying overpriced properties.
Based on the evaluation of various machine learning models, including logistic regression, decision trees, random forest, gradient boost, neural network, and ensemble methods, it is evident that the models perform differently in terms of accuracy, precision, and recall. While the logistic regression model exhibits the highest recall, indicating its effectiveness in identifying overpriced properties, the ensemble model shows improved accuracy and precision. Considering the assignment's objective to develop a data-driven solution for a real estate company seeking to invest in the Nashville area, the ensemble model offers a balance between overall performance and the ability to detect overpricing. By leveraging ensemble techniques and the insights gained from model evaluation, the real estate company can make informed investment decisions and maximize returns in the dynamic Nashville real estate market.
Husnoo, A. (2020, November 13). A practical guide to logistic regression in Python for Beginners. Medium. https://medium.com/analytics-vidhya/a-practical-guide-to-logistic-regression-in-python-for-beginners-f04cf6b63d33#:~:text=To%20check%20for%20multi%2Dcollinearity%20in%20the%20independent%20variables%2C%20the,in%20the%20logistic%20regression%20model.
Okamura, S. (2020, December 30). GRIDSEARCHCV for beginners. Medium. https://towardsdatascience.com/gridsearchcv-for-beginners-db48a90114ee
Sklearn.model_selection.GRIDSEARCHCV. scikit. (n.d.-b). https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html